#COMMENT IF NOT USING COLAB VM
# This mounts your Google Drive to the Colab VM.
from google.colab import drive
drive.mount('/content/drive')
# TODO: Enter the foldername in your Drive where you have saved the unzipped
# assignment folder, e.g. 'DeepLearning/assignments/assignment5/'
FOLDERNAME = "CS6353/Assignments/assignment5/assignment5/"
assert FOLDERNAME is not None, "[!] Enter the foldername."
# Now that we've mounted your Drive, this ensures that
# the Python interpreter of the Colab VM can load
# python files from within it.
import sys
sys.path.append('/content/drive/My Drive/{}'.format(FOLDERNAME))
# This downloads the CIFAR-10 dataset to your Drive
# if it doesn't already exist.
%cd /content/drive/My\ Drive/$FOLDERNAME/cs6353/datasets/
!bash get_datasets.sh
%cd /content/drive/My\ Drive/$FOLDERNAME
Mounted at /content/drive /content/drive/My Drive/CS6353/Assignments/assignment5/assignment5/cs6353/datasets --2024-11-28 04:16:02-- http://supermoe.cs.umass.edu/682/asgns/coco_captioning.zip Resolving supermoe.cs.umass.edu (supermoe.cs.umass.edu)... 128.119.244.95 Connecting to supermoe.cs.umass.edu (supermoe.cs.umass.edu)|128.119.244.95|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 1035210391 (987M) [application/zip] Saving to: βcoco_captioning.zipβ coco_captioning.zip 100%[===================>] 987.25M 8.59MB/s in 5m 8s 2024-11-28 04:21:10 (3.21 MB/s) - βcoco_captioning.zipβ saved [1035210391/1035210391] Archive: coco_captioning.zip replace coco_captioning/coco2014_captions.h5? [y]es, [n]o, [A]ll, [N]one, [r]ename: A inflating: coco_captioning/coco2014_captions.h5 inflating: coco_captioning/coco2014_vocab.json inflating: coco_captioning/train2014_images.txt inflating: coco_captioning/train2014_urls.txt inflating: coco_captioning/train2014_vgg16_fc7.h5 inflating: coco_captioning/train2014_vgg16_fc7_pca.h5 inflating: coco_captioning/val2014_images.txt inflating: coco_captioning/val2014_urls.txt inflating: coco_captioning/val2014_vgg16_fc7.h5 inflating: coco_captioning/val2014_vgg16_fc7_pca.h5 --2024-11-28 04:23:12-- http://supermoe.cs.umass.edu/682/asgns/squeezenet_tf.zip Resolving supermoe.cs.umass.edu (supermoe.cs.umass.edu)... 128.119.244.95 Connecting to supermoe.cs.umass.edu (supermoe.cs.umass.edu)|128.119.244.95|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 9202140 (8.8M) [application/zip] Saving to: βsqueezenet_tf.zipβ squeezenet_tf.zip 100%[===================>] 8.78M 2.46MB/s in 4.3s 2024-11-28 04:23:16 (2.03 MB/s) - βsqueezenet_tf.zipβ saved [9202140/9202140] Archive: squeezenet_tf.zip replace squeezenet.ckpt.data-00000-of-00001? [y]es, [n]o, [A]ll, [N]one, [r]ename: A inflating: squeezenet.ckpt.data-00000-of-00001 inflating: squeezenet.ckpt.index inflating: squeezenet.ckpt.meta --2024-11-28 04:26:30-- http://supermoe.cs.umass.edu/682/asgns/imagenet_val_25.npz Resolving supermoe.cs.umass.edu (supermoe.cs.umass.edu)... 128.119.244.95 Connecting to supermoe.cs.umass.edu (supermoe.cs.umass.edu)|128.119.244.95|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 3940548 (3.8M) Saving to: βimagenet_val_25.npz.6β imagenet_val_25.npz 100%[===================>] 3.76M 2.05MB/s in 1.8s 2024-11-28 04:26:32 (2.05 MB/s) - βimagenet_val_25.npz.6β saved [3940548/3940548] /content/drive/My Drive/CS6353/Assignments/assignment5/assignment5
# #UNCOMMENT IF USING CADE
# import os
# ##### Request a GPU #####
# ## This function locates an available gpu for usage. In addition, this function reserves a specificed
# ## memory space exclusively for your account. The memory reservation prevents the decrement in computational
# ## speed when other users try to allocate memory on the same gpu in the shared systems, i.e., CADE machines.
# ## Note: If you use your own system which has a GPU with less than 4GB of memory, remember to change the
# ## specified mimimum memory.
# def define_gpu_to_use(minimum_memory_mb = 3500):
# thres_memory = 600 #
# gpu_to_use = None
# try:
# os.environ['CUDA_VISIBLE_DEVICES']
# print('GPU already assigned before: ' + str(os.environ['CUDA_VISIBLE_DEVICES']))
# return
# except:
# pass
# for i in range(16):
# free_memory = !nvidia-smi --query-gpu=memory.free -i $i --format=csv,nounits,noheader
# if free_memory[0] == 'No devices were found':
# break
# free_memory = int(free_memory[0])
# if free_memory>minimum_memory_mb-thres_memory:
# gpu_to_use = i
# break
# if gpu_to_use is None:
# print('Could not find any GPU available with the required free memory of ' + str(minimum_memory_mb) \
# + 'MB. Please use a different system for this assignment.')
# else:
# os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_to_use)
# print('Chosen GPU: ' + str(gpu_to_use))
# ## Request a gpu and reserve the memory space
# define_gpu_to_use(4000)
Image Captioning with RNNsΒΆ
In this exercise you will implement a vanilla recurrent neural networks and use them it to train a model that can generate novel captions for images.
# As usual, a bit of setup
import time, os, json
import numpy as np
import matplotlib.pyplot as plt
from cs6353.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
from cs6353.rnn_layers import *
from cs6353.captioning_solver import CaptioningSolver
from cs6353.classifiers.rnn import CaptioningRNN
from cs6353.coco_utils import load_coco_data, sample_coco_minibatch, decode_captions
from cs6353.image_utils import image_from_url
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
def rel_error(x, y):
""" returns relative error """
return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))
Install h5pyΒΆ
The COCO dataset we will be using is stored in HDF5 format. To load HDF5 files, we will need to install the h5py Python package. Check if h5py is already installed:
import h5py
If the modual is not found, you will need to install it now. From the command line, run:
pip install h5py
If you receive a permissions error, you may need to run the command as root:
sudo pip install h5py
You can also run commands directly from the Jupyter notebook by prefixing the command with the "!" character:
!pip install h5py
Requirement already satisfied: h5py in /usr/local/lib/python3.10/dist-packages (3.12.1) Requirement already satisfied: numpy>=1.19.3 in /usr/local/lib/python3.10/dist-packages (from h5py) (1.26.4)
Microsoft COCOΒΆ
For this exercise we will use the 2014 release of the Microsoft COCO dataset which has become the standard testbed for image captioning. The dataset consists of 80,000 training images and 40,000 validation images, each annotated with 5 captions written by workers on Amazon Mechanical Turk.
You should have already downloaded the data by changing to the cs6353/datasets directory and running the script get_assignment3_data.sh. If you haven't yet done so, run that script now. Warning: the COCO data download is ~1GB.
We have preprocessed the data and extracted features for you already. For all images we have extracted features from the fc7 layer of the VGG-16 network pretrained on ImageNet; these features are stored in the files train2014_vgg16_fc7.h5 and val2014_vgg16_fc7.h5 respectively. To cut down on processing time and memory requirements, we have reduced the dimensionality of the features from 4096 to 512; these features can be found in the files train2014_vgg16_fc7_pca.h5 and val2014_vgg16_fc7_pca.h5.
The raw images take up a lot of space (nearly 20GB) so we have not included them in the download. However all images are taken from Flickr, and URLs of the training and validation images are stored in the files train2014_urls.txt and val2014_urls.txt respectively. This allows you to download images on the fly for visualization. Since images are downloaded on-the-fly, you must be connected to the internet to view images.
Dealing with strings is inefficient, so we will work with an encoded version of the captions. Each word is assigned an integer ID, allowing us to represent a caption by a sequence of integers. The mapping between integer IDs and words is in the file coco2014_vocab.json, and you can use the function decode_captions from the file cs6353/coco_utils.py to convert numpy arrays of integer IDs back into strings.
There are a couple special tokens that we add to the vocabulary. We prepend a special <START> token and append an <END> token to the beginning and end of each caption respectively. Rare words are replaced with a special <UNK> token (for "unknown"). In addition, since we want to train with minibatches containing captions of different lengths, we pad short captions with a special <NULL> token after the <END> token and don't compute loss or gradient for <NULL> tokens. Since they are a bit of a pain, we have taken care of all implementation details around special tokens for you.
You can load all of the MS-COCO data (captions, features, URLs, and vocabulary) using the load_coco_data function from the file cs6353/coco_utils.py. Run the following cell to do so:
# Load COCO data from disk; this returns a dictionary
# We'll work with dimensionality-reduced features for this notebook, but feel
# free to experiment with the original features by changing the flag below.
data = load_coco_data(pca_features=True)
# Print out all the keys and values from the data dictionary
for k, v in data.items():
if type(v) == np.ndarray:
print(k, type(v), v.shape, v.dtype)
else:
print(k, type(v), len(v))
train_captions <class 'numpy.ndarray'> (400135, 17) int32 train_image_idxs <class 'numpy.ndarray'> (400135,) int32 val_captions <class 'numpy.ndarray'> (195954, 17) int32 val_image_idxs <class 'numpy.ndarray'> (195954,) int32 train_features <class 'numpy.ndarray'> (82783, 512) float32 val_features <class 'numpy.ndarray'> (40504, 512) float32 idx_to_word <class 'list'> 1004 word_to_idx <class 'dict'> 1004 train_urls <class 'numpy.ndarray'> (82783,) <U63 val_urls <class 'numpy.ndarray'> (40504,) <U63
Look at the dataΒΆ
It is always a good idea to look at examples from the dataset before working with it.
You can use the sample_coco_minibatch function from the file cs6353/coco_utils.py to sample minibatches of data from the data structure returned from load_coco_data. Run the following to sample a small minibatch of training data and show the images and their captions. Running it multiple times and looking at the results helps you to get a sense of the dataset.
Note that we decode the captions using the decode_captions function and that we download the images on-the-fly using their Flickr URL, so you must be connected to the internet to view images.
# Sample a minibatch and show the images and captions
batch_size = 3
captions, features, urls = sample_coco_minibatch(data, batch_size=batch_size)
for i, (caption, url) in enumerate(zip(captions, urls)):
plt.imshow(image_from_url(url))
plt.axis('off')
caption_str = decode_captions(caption, data['idx_to_word'])
plt.title(caption_str)
plt.show()
Recurrent Neural NetworksΒΆ
As discussed in lecture, we will use recurrent neural network (RNN) language models for image captioning. The file cs6353/rnn_layers.py contains implementations of different layer types that are needed for recurrent neural networks, and the file cs6353/classifiers/rnn.py uses these layers to implement an image captioning model.
We will first implement different types of RNN layers in cs6353/rnn_layers.py.
Vanilla RNN: step forwardΒΆ
Open the file cs6353/rnn_layers.py. This file implements the forward and backward passes for different types of layers that are commonly used in recurrent neural networks.
First implement the function rnn_step_forward which implements the forward pass for a single timestep of a vanilla recurrent neural network. After doing so run the following to check your implementation. You should see errors on the order of e-8 or less.
N, D, H = 3, 10, 4
x = np.linspace(-0.4, 0.7, num=N*D).reshape(N, D)
prev_h = np.linspace(-0.2, 0.5, num=N*H).reshape(N, H)
Wx = np.linspace(-0.1, 0.9, num=D*H).reshape(D, H)
Wh = np.linspace(-0.3, 0.7, num=H*H).reshape(H, H)
b = np.linspace(-0.2, 0.4, num=H)
next_h, _ = rnn_step_forward(x, prev_h, Wx, Wh, b)
expected_next_h = np.asarray([
[-0.58172089, -0.50182032, -0.41232771, -0.31410098],
[ 0.66854692, 0.79562378, 0.87755553, 0.92795967],
[ 0.97934501, 0.99144213, 0.99646691, 0.99854353]])
print('next_h error: ', rel_error(expected_next_h, next_h))
next_h error: 6.292421426471037e-09
Vanilla RNN: step backwardΒΆ
In the file cs6353/rnn_layers.py implement the rnn_step_backward function. After doing so run the following to numerically gradient check your implementation. You should see errors on the order of e-8 or less.
from cs6353.rnn_layers import rnn_step_forward, rnn_step_backward
np.random.seed(231)
N, D, H = 4, 5, 6
x = np.random.randn(N, D)
h = np.random.randn(N, H)
Wx = np.random.randn(D, H)
Wh = np.random.randn(H, H)
b = np.random.randn(H)
out, cache = rnn_step_forward(x, h, Wx, Wh, b)
dnext_h = np.random.randn(*out.shape)
fx = lambda x: rnn_step_forward(x, h, Wx, Wh, b)[0]
fh = lambda prev_h: rnn_step_forward(x, h, Wx, Wh, b)[0]
fWx = lambda Wx: rnn_step_forward(x, h, Wx, Wh, b)[0]
fWh = lambda Wh: rnn_step_forward(x, h, Wx, Wh, b)[0]
fb = lambda b: rnn_step_forward(x, h, Wx, Wh, b)[0]
dx_num = eval_numerical_gradient_array(fx, x, dnext_h)
dprev_h_num = eval_numerical_gradient_array(fh, h, dnext_h)
dWx_num = eval_numerical_gradient_array(fWx, Wx, dnext_h)
dWh_num = eval_numerical_gradient_array(fWh, Wh, dnext_h)
db_num = eval_numerical_gradient_array(fb, b, dnext_h)
dx, dprev_h, dWx, dWh, db = rnn_step_backward(dnext_h, cache)
print('dx error: ', rel_error(dx_num, dx))
print('dprev_h error: ', rel_error(dprev_h_num, dprev_h))
print('dWx error: ', rel_error(dWx_num, dWx))
print('dWh error: ', rel_error(dWh_num, dWh))
print('db error: ', rel_error(db_num, db))
dx error: 2.7795541640745535e-10 dprev_h error: 2.732467428030486e-10 dWx error: 9.709219069305414e-10 dWh error: 5.034262638717296e-10 db error: 1.708752322503098e-11
Vanilla RNN: forwardΒΆ
Now that you have implemented the forward and backward passes for a single timestep of a vanilla RNN, you will combine these pieces to implement a RNN that processes an entire sequence of data.
In the file cs6353/rnn_layers.py, implement the function rnn_forward. This should be implemented using the rnn_step_forward function that you defined above. After doing so run the following to check your implementation. You should see errors on the order of e-7 or less.
N, T, D, H = 2, 3, 4, 5
x = np.linspace(-0.1, 0.3, num=N*T*D).reshape(N, T, D)
h0 = np.linspace(-0.3, 0.1, num=N*H).reshape(N, H)
Wx = np.linspace(-0.2, 0.4, num=D*H).reshape(D, H)
Wh = np.linspace(-0.4, 0.1, num=H*H).reshape(H, H)
b = np.linspace(-0.7, 0.1, num=H)
h, _ = rnn_forward(x, h0, Wx, Wh, b)
expected_h = np.asarray([
[
[-0.42070749, -0.27279261, -0.11074945, 0.05740409, 0.22236251],
[-0.39525808, -0.22554661, -0.0409454, 0.14649412, 0.32397316],
[-0.42305111, -0.24223728, -0.04287027, 0.15997045, 0.35014525],
],
[
[-0.55857474, -0.39065825, -0.19198182, 0.02378408, 0.23735671],
[-0.27150199, -0.07088804, 0.13562939, 0.33099728, 0.50158768],
[-0.51014825, -0.30524429, -0.06755202, 0.17806392, 0.40333043]]])
print('h error: ', rel_error(expected_h, h))
h error: 7.728466151011529e-08
Vanilla RNN: backwardΒΆ
In the file cs6353/rnn_layers.py, implement the backward pass for a vanilla RNN in the function rnn_backward. This should run back-propagation over the entire sequence, making calls to the rnn_step_backward function that you defined earlier. You should see errors on the order of e-6 or less.
np.random.seed(231)
N, D, T, H = 2, 5, 10, 5
x = np.random.randn(N, T, D)
h0 = np.random.randn(N, H)
Wx = np.random.randn(D, H)
Wh = np.random.randn(H, H)
b = np.random.randn(H)
out, cache = rnn_forward(x, h0, Wx, Wh, b)
dout = np.random.randn(*out.shape)
dx, dh0, dWx, dWh, db = rnn_backward(dout, cache)
fx = lambda x: rnn_forward(x, h0, Wx, Wh, b)[0]
fh0 = lambda h0: rnn_forward(x, h0, Wx, Wh, b)[0]
fWx = lambda Wx: rnn_forward(x, h0, Wx, Wh, b)[0]
fWh = lambda Wh: rnn_forward(x, h0, Wx, Wh, b)[0]
fb = lambda b: rnn_forward(x, h0, Wx, Wh, b)[0]
dx_num = eval_numerical_gradient_array(fx, x, dout)
dh0_num = eval_numerical_gradient_array(fh0, h0, dout)
dWx_num = eval_numerical_gradient_array(fWx, Wx, dout)
dWh_num = eval_numerical_gradient_array(fWh, Wh, dout)
db_num = eval_numerical_gradient_array(fb, b, dout)
print('dx error: ', rel_error(dx_num, dx))
print('dh0 error: ', rel_error(dh0_num, dh0))
print('dWx error: ', rel_error(dWx_num, dWx))
print('dWh error: ', rel_error(dWh_num, dWh))
print('db error: ', rel_error(db_num, db))
dx error: 3.84928063719157e-09 dh0 error: 1.020473174359301e-10 dWx error: 1.7230110684806883e-10 dWh error: 2.4102509807628146e-09 db error: 7.937656148540516e-09
Word embedding: forwardΒΆ
In deep learning systems, we commonly represent words using vectors. Each word of the vocabulary will be associated with a vector, and these vectors will be learned jointly with the rest of the system.
In the file cs6353/rnn_layers.py, implement the function word_embedding_forward to convert words (represented by integers) into vectors. Run the following to check your implementation. You should see an error on the order of e-8 or less.
N, T, V, D = 2, 4, 5, 3
x = np.asarray([[0, 3, 1, 2], [2, 1, 0, 3]])
W = np.linspace(0, 1, num=V*D).reshape(V, D)
out, _ = word_embedding_forward(x, W)
expected_out = np.asarray([
[[ 0., 0.07142857, 0.14285714],
[ 0.64285714, 0.71428571, 0.78571429],
[ 0.21428571, 0.28571429, 0.35714286],
[ 0.42857143, 0.5, 0.57142857]],
[[ 0.42857143, 0.5, 0.57142857],
[ 0.21428571, 0.28571429, 0.35714286],
[ 0., 0.07142857, 0.14285714],
[ 0.64285714, 0.71428571, 0.78571429]]])
print('out error: ', rel_error(expected_out, out))
out error: 1.0000000094736443e-08
Word embedding: backwardΒΆ
Implement the backward pass for the word embedding function in the function word_embedding_backward. After doing so run the following to numerically gradient check your implementation. You should see an error on the order of e-11 or less.
np.random.seed(231)
N, T, V, D = 50, 3, 5, 6
x = np.random.randint(V, size=(N, T))
W = np.random.randn(V, D)
out, cache = word_embedding_forward(x, W)
dout = np.random.randn(*out.shape)
dW = word_embedding_backward(dout, cache)
f = lambda W: word_embedding_forward(x, W)[0]
dW_num = eval_numerical_gradient_array(f, W, dout)
print('dW error: ', rel_error(dW, dW_num))
dW error: 3.2774595693100364e-12
Temporal Affine layerΒΆ
At every timestep we use an affine function to transform the RNN hidden vector at that timestep into scores for each word in the vocabulary. Because this is very similar to the affine layer that you implemented in assignment 3, we have provided this function for you in the temporal_affine_forward and temporal_affine_backward functions in the file cs6353/rnn_layers.py. Run the following to perform numeric gradient checking on the implementation. You should see errors on the order of e-9 or less.
np.random.seed(231)
# Gradient check for temporal affine layer
N, T, D, M = 2, 3, 4, 5
x = np.random.randn(N, T, D)
w = np.random.randn(D, M)
b = np.random.randn(M)
out, cache = temporal_affine_forward(x, w, b)
dout = np.random.randn(*out.shape)
fx = lambda x: temporal_affine_forward(x, w, b)[0]
fw = lambda w: temporal_affine_forward(x, w, b)[0]
fb = lambda b: temporal_affine_forward(x, w, b)[0]
dx_num = eval_numerical_gradient_array(fx, x, dout)
dw_num = eval_numerical_gradient_array(fw, w, dout)
db_num = eval_numerical_gradient_array(fb, b, dout)
dx, dw, db = temporal_affine_backward(dout, cache)
print('dx error: ', rel_error(dx_num, dx))
print('dw error: ', rel_error(dw_num, dw))
print('db error: ', rel_error(db_num, db))
dx error: 2.9215945034030545e-10 dw error: 1.5772088618663602e-10 db error: 3.252200556967514e-11
Temporal Softmax lossΒΆ
In an RNN language model, at every timestep we produce a score for each word in the vocabulary. We know the ground-truth word at each timestep, so we use a softmax loss function to compute loss and gradient at each timestep. We sum the losses over time and average them over the minibatch.
However there is one wrinkle: since we operate over minibatches and different captions may have different lengths, we append <NULL> tokens to the end of each caption so they all have the same length. We don't want these <NULL> tokens to count toward the loss or gradient, so in addition to scores and ground-truth labels our loss function also accepts a mask array that tells it which elements of the scores count towards the loss.
Since this is very similar to the softmax loss function you implemented in assignment 2, we have implemented this loss function for you; look at the temporal_softmax_loss function in the file cs6353/rnn_layers.py.
Run the following cell to sanity check the loss and perform numeric gradient checking on the function. You should see an error for dx on the order of e-7 or less.
# Sanity check for temporal softmax loss
from cs6353.rnn_layers import temporal_softmax_loss
N, T, V = 100, 1, 10
def check_loss(N, T, V, p):
x = 0.001 * np.random.randn(N, T, V)
y = np.random.randint(V, size=(N, T))
mask = np.random.rand(N, T) <= p
print(temporal_softmax_loss(x, y, mask)[0])
check_loss(100, 1, 10, 1.0) # Should be about 2.3
check_loss(100, 10, 10, 1.0) # Should be about 23
check_loss(5000, 10, 10, 0.1) # Should be about 2.3
# Gradient check for temporal softmax loss
N, T, V = 7, 8, 9
x = np.random.randn(N, T, V)
y = np.random.randint(V, size=(N, T))
mask = (np.random.rand(N, T) > 0.5)
loss, dx = temporal_softmax_loss(x, y, mask, verbose=False)
dx_num = eval_numerical_gradient(lambda x: temporal_softmax_loss(x, y, mask)[0], x, verbose=False)
print('dx error: ', rel_error(dx, dx_num))
2.3027781774290146 23.025985953127226 2.2643611790293394 dx error: 2.583585303524283e-08
RNN for image captioningΒΆ
Now that you have implemented the necessary layers, you can combine them to build an image captioning model. Open the file cs6353/classifiers/rnn.py and look at the CaptioningRNN class.
Implement the forward and backward pass of the model in the loss function. For now you only need to implement the case where cell_type='rnn' for vanialla RNNs; you will implement the LSTM case later. After doing so, run the following to check your forward pass using a small test case; You should see an error of about 0.02 or less.
N, D, W, H = 10, 20, 40, 40
word_to_idx = {'<NULL>': 0, 'cat': 2, 'dog': 3}
V = len(word_to_idx)
T = 13
model = CaptioningRNN(word_to_idx,
input_dim=D,
wordvec_dim=W,
hidden_dim=H,
cell_type='rnn',
dtype=np.float64)
# Set all model parameters to fixed values
for k, v in model.params.items():
model.params[k] = np.linspace(-1.4, 1.3, num=v.size).reshape(*v.shape)
features = np.linspace(-1.5, 0.3, num=(N * D)).reshape(N, D)
captions = (np.arange(N * T) % V).reshape(N, T)
loss, grads = model.loss(features, captions)
expected_loss = 9.83235591003
print('loss: ', loss)
print('expected loss: ', expected_loss)
print('difference: ', abs(loss - expected_loss))
loss: 9.809174730925443 expected loss: 9.83235591003 difference: 0.023181179104556193
Run the following cell to perform numeric gradient checking on the CaptioningRNN class; you should see errors around the order of e-6 or less.
np.random.seed(231)
batch_size = 2
timesteps = 3
input_dim = 4
wordvec_dim = 6
hidden_dim = 6
word_to_idx = {'<NULL>': 0, 'cat': 2, 'dog': 3}
vocab_size = len(word_to_idx)
captions = np.random.randint(vocab_size, size=(batch_size, timesteps))
features = np.random.randn(batch_size, input_dim)
model = CaptioningRNN(word_to_idx,
input_dim=input_dim,
wordvec_dim=wordvec_dim,
hidden_dim=hidden_dim,
cell_type='rnn',
dtype=np.float64,
)
loss, grads = model.loss(features, captions)
for param_name in sorted(grads):
f = lambda _: model.loss(features, captions)[0]
param_grad_num = eval_numerical_gradient(f, model.params[param_name], verbose=False, h=1e-6)
e = rel_error(param_grad_num, grads[param_name])
print('%s relative error: %e' % (param_name, e))
W_embed relative error: 1.350162e-09 W_proj relative error: 7.760852e-09 W_vocab relative error: 1.879471e-09 Wh relative error: 8.772596e-09 Wx relative error: 4.146389e-07 b relative error: 5.270458e-10 b_proj relative error: 4.936156e-09 b_vocab relative error: 3.619788e-10
Overfit small dataΒΆ
Similar to the Solver class that we used to train image classification models on the previous assignment, on this assignment we use a CaptioningSolver class to train image captioning models. Open the file cs6353/captioning_solver.py and read through the CaptioningSolver class; it should look very familiar.
Once you have familiarized yourself with the API, run the following to make sure your model overfits a small sample of 100 training examples. You should see a final loss very close to 0.1
np.random.seed(231)
small_data = load_coco_data(max_train=50)
small_rnn_model = CaptioningRNN(
cell_type='rnn',
word_to_idx=data['word_to_idx'],
input_dim=data['train_features'].shape[1],
hidden_dim=512,
wordvec_dim=512,
)
small_rnn_solver = CaptioningSolver(small_rnn_model, small_data,
update_rule='adam',
num_epochs=50,
batch_size=25,
optim_config={
'learning_rate': 5e-3,
},
lr_decay=0.95,
verbose=True, print_every=10,
)
small_rnn_solver.train()
# Plot the training losses
plt.plot(small_rnn_solver.loss_history)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Training loss history')
plt.show()
(Iteration 1 / 100) loss: 78.117267 (Iteration 11 / 100) loss: 22.548865 (Iteration 21 / 100) loss: 5.637183 (Iteration 31 / 100) loss: 1.219399 (Iteration 41 / 100) loss: 0.518500 (Iteration 51 / 100) loss: 0.217734 (Iteration 61 / 100) loss: 0.170852 (Iteration 71 / 100) loss: 0.156351 (Iteration 81 / 100) loss: 0.145743 (Iteration 91 / 100) loss: 0.102868
Test-time samplingΒΆ
Unlike classification models, image captioning models behave very differently at training time and at test time. At training time, we have access to the ground-truth caption, so we feed ground-truth words as input to the RNN at each timestep. At test time, we sample from the distribution over the vocabulary at each timestep, and feed the sample as input to the RNN at the next timestep.
In the file cs6353/classifiers/rnn.py, implement the sample method for test-time sampling. After doing so, run the following to sample from your overfitted model on both training and validation data. The samples on training data should be very good; the samples on validation data probably won't make sense.
for split in ['train', 'val']:
minibatch = sample_coco_minibatch(small_data, split=split, batch_size=2)
gt_captions, features, urls = minibatch
gt_captions = decode_captions(gt_captions, data['idx_to_word'])
sample_captions = small_rnn_model.sample(features)
sample_captions = decode_captions(sample_captions, data['idx_to_word'])
for gt_caption, sample_caption, url in zip(gt_captions, sample_captions, urls):
plt.imshow(image_from_url(url))
plt.title('%s\n%s\nGT:%s' % (split, sample_caption, gt_caption))
plt.axis('off')
plt.show()
INLINE QUESTION 1ΒΆ
In our current image captioning setup, our RNN language model produces a word at every timestep as its output. However, an alternate way to pose the problem is to train the network to operate over characters (e.g. 'a', 'b', etc.) as opposed to words, so that at it every timestep, it receives the previous character as input and tries to predict the next character in the sequence. For example, the network might generate a caption like
'A', ' ', 'c', 'a', 't', ' ', 'o', 'n', ' ', 'a', ' ', 'b', 'e', 'd'
Can you describe one advantage of an image-captioning model that uses a character-level RNN? Can you also describe one disadvantage? HINT: there are several valid answers, but it might be useful to compare the parameter space of word-level and character-level models.
Answer:
Advantage: Flexibility in Generating New Words
- A character-level RNN can create words it has never seen before because it works with individual letters instead of whole words. This is useful when the model needs to handle rare or completely new words, like names or technical terms, that a word-based model might not know.
Disadvantage: Longer Sequences and Slower Training
- Since it generates text one letter at a time, the sequences are much longer than word-based models, making training slower and more difficult for the RNN to learn meaningful patterns. This means that the RNN might require more number of layers and a larger recurrent neural network to understand the dependencies between each character.